Basic of Neural Networks-2

本文为 Andrew Ng 深度学习课程第一部分神经网络和深度学习的笔记,对应第二周神经网络基础相关课程及第二周作业。

Vectorization

Vectorization (向量化) 的意义在于:消除代码中显式的调用for循环。在深度学习领域中,你常常需要训练大数据集,所以程序运行的效率非常重要,否则需要等待很长时间才能得到结果。

在对数几率回归中,你需要计算 $z = w^{T}x + b$ ,其中 $w, x$ 都是 $n_x$ 维向量

如果你是用非向量化的实现,即传统的矩阵相乘算法伪代码如下:
$ z = 0 $
$ for \quad i \quad in \quad range(n_x) : $
$ \quad z+= w[i] * x[i] $
$z+=b ​$

若使用向量化的实现,Python代码如下:

1
z = np.dot(w,x) + b

即清晰又高效

可以测试一下这两个代码在效率方面的差距,大约差了300倍。可以试想一下,如果你的代码1分钟出结果,和5个小时才出结果,那可差太远了。

为了加快深度学习的运算速度,可以使用GPU (Graphic Processing Unit) 。事实上,GPU 和 CPU 都有并行指令 (Parallelization Instructions) ,同时也叫作 SIMD (Single Instruction Multiple Data),即单指令流多数据流,是一种采用一个控制器来控制多个处理器,同时对一组数据 (又称“数据向量”) 中的每一个分别执行相同的操作从而实现空间上的并行性的技术。numpy 是 Python 数据分析及科学计算的基础库,它有许多内置 (Built-in) 函数,主要用于数组的计算,充分利用了并行化,使得运算速度大大提高。在深度学习的领域,一般来说,能不用显式的调用for循环就不用。

这样,我们就可以使用 Vectorization 来优化梯度下降算法,先去掉内层对 feature (特征 $w1, w2 …$) 的循环 :

$J=0; db=0; dw = np.zeros(n_x,1)$
$for \quad i = 1 \quad to \quad m $
$\quad z^{(i)} = w^{T}x^{(i)}+b$
$\quad a^{(i)} = \sigma(z^{(i)})$
$\quad J += -(y^{(i)} \log a^{(i)} + (1-y^{(i)}) \log(1-a^{(i)}))$
$\quad dz^{(i)} = a^{(i)}-y^{(i)}$
$\quad dw+=x^{(i)}dz^{(i)} \quad \quad $ //vectorization
$\quad db += dz^{(i)}$
$J /= m$
$dw /= m$
$db /= m​$

然后,我们再去掉对 $m$ 个训练样本的外层循环,分别从正向传播和反向传播两方面来分析:

正向传播

回顾一下对数几率回归的正向传播步骤,如果你有 $m$ 个训练样本

那么对第一个样本进行预测,你需要计算
$ \quad z^{(1)} = w^{T}x^{(1)} + b$
$ \quad a^{(1)} = \sigma(z^{(1)})​$

然后继续对第二个样本进行预测
$ \quad z^{(2)} = w^{T}x^{(2)} + b$
$ \quad a^{(2)} = \sigma(z^{(2)})$

然后继续对第三个,第四个,…,直到第 $m$ 个

回忆一下之前在二分分类部分所讲到的用更紧凑的符号 $X​$ 表示整个训练集,即大小为 $(n_x,m)​$ 的矩阵 :

那么计算 $z^{(1)}, z^{(2)}, … , z^{(m)}$ 的步骤如下 :
首先先构造一个 $(1,m)$ 的矩阵 $[z^{(1)}, z^{(2)}, … , z^{(m)}]$ ,则

在 Python 中一句代码即可完成上述过程

1
Z = np.dot(w.T,X) + b

你可能会有疑问,明明这里的 $b$ 只是一个实数 (或者说是一个 $b_{(1,1)}$ 的矩阵) ,为什么可以和矩阵 $Z_{(1,m)}$ 相加?事实上,当做 $Z+b$ 这个操作时,Python 会自动把矩阵 $b_{(1,1)}$ 自动扩展为 $b_{(1,m)}$ 这样的一个行向量,在 Python 中这称为 Broadcasting (广播) ,现在你只要看得懂就好,接下来会更详细地说明它。

同理我们可以得到

同样也只需一句 Python 代码

1
A = sigmoid(Z)
反向传播

接着,我们来看如何用向量化优化反向传播,计算梯度

同样,你需要计算
$dz^{(1)} = a^{(1)} - y^{(1)}​$
$dz^{(2)} = a^{(2)} - y^{(2)}​$
$…​$
一直到第 $m​$ 个

即计算 $dZ = [dz^{(1)} , dz^{(2)} , … , dz^{(m)}]$

之前我们已经得到 $A = [a^{(1)}, a^{(2)}, … , a^{(m)}] = \sigma (Z)​$
我们再定义输出标签 $Y = [y^{(1)}, y^{(2)}, … , y^{(m)}]​$

那么,

有了 $dZ$ 我们就可以计算 $dw, db​$
根据之前的公式,有

对应的 Python 代码即为

1
2
db = np.sum(dZ)/m
dw = np.dot(x,dZ.T)/m

最后,我们可以得到向量化后的梯度下降算法

1
2
3
4
5
6
7
import numpy as np
Z = np.dot(w.T,X) + b
A = sigmoid(Z)
dw = np.dot(x,dZ.T)/m
db = np.sum(dZ)/m
w = w - alpha * dw
b = b - alpha * db

你可能会有疑问,为什么这里不需要再计算 Cost function (代价函数) $J$ 了,笔者认为 $J$ 只是对数几率回归模型所需要的损失函数,借助它我们才能计算出 $dw, db$ ,从而进行迭代。在后续的作业中,计算 $J​$ 可以帮助我们对模型进一步分析。
这样,我们就完成了对数几率回归的梯度下降的一次迭代,但如果你需要多次执行迭代的操作,只能显式的调用for循环。

Broadcasting

Broadcasting (广播) 机制主要是为了方便不同 shape 的 array (可以理解为不同形状的矩阵) 进行数学运算

  • 当我们将向量和一个常量做加减乘除操作时,比如对数几率回归中的 $w^{T}x+b$ ,会对向量中的每一格元素都和常量做一次操作,或者你可以理解为把这个数复制 $m \times n$ 次,使其变为一个形状相同的 $(m,n)$ 矩阵,如 :
  • 一个 $(m,n)$ 矩阵和一个 $(1,n)$ 矩阵相加 (减乘除),会将这个 $(1,n)$ 矩阵复制 $m$ 次,使其变为 $(m,n)$ 矩阵然后相加,如 :
  • 同样地,一个 $(m,n)$ 矩阵和一个 $(m,1)$ 矩阵相加 (减乘除),会将这个 $(m,1)$ 矩阵复制 $n$ 次,使其变为 $(m,n)$ 矩阵然后相加

通俗的讲,numpy 会通过复制的方法,使两个不同形状的矩阵变得一致,再执行相关操作。值得一提的是,为了保证运算按照我们的想法进行,使用 reshape() 函数是一个较好的习惯。

Homework

附上所有代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
# LogisticRegression.py
import numpy as np
import matplotlib.pyplot as plt
import h5py
import scipy

from PIL import Image
from scipy import ndimage
from lr_utils import load_dataset


# Sigmoid 函数
def sigmoid(z):
"""
Compute the sigmoid of z

Arguments:
z -- A scalar or numpy array of any size.

Return:
s -- sigmoid(z)
"""

s = 1 / (1 + np.exp(-z))
return s


# 初始化 w,b
def initialize_with_zeros(dim):
"""
This function creates a vector of zeros of shape (dim, 1) for w and initializes b to 0.

Argument:
dim -- size of the w vector we want (or number of parameters in this case)

Returns:
w -- initialized vector of shape (dim, 1)
b -- initialized scalar (corresponds to the bias)
"""

w = np.zeros((dim, 1)) # (dim, 1) 是shape参数,代表初始化一个dim*1的矩阵
b = 0

assert (w.shape == (dim, 1))
assert (isinstance(b, float) or isinstance(b, int))

return w, b


# propagate 正向与反向传播
def propagate(w, b, X, Y):
"""
Implement the cost function and its gradient for the propagation explained above

Arguments:
w -- weights, a numpy array of size (num_px * num_px * 3, 1)
b -- bias, a scalar
X -- data of size (num_px * num_px * 3, number of examples)
Y -- true "label" vector (containing 0 if non-cat, 1 if cat) of size (1, number of examples)

Return:
cost -- negative log-likelihood cost for logistic regression
dw -- gradient of the loss with respect to w, thus same shape as w
db -- gradient of the loss with respect to b, thus same shape as b

Tips:
- Write your code step by step for the propagation. np.log(), np.dot()
"""

m = X.shape[1]

# FORWARD PROPAGATION (FROM X TO COST)
A = sigmoid(np.dot(w.T, X) + b) # compute activation
cost = -1 / m * (np.dot(Y,np.log(A).T) + np.dot(1 - Y,np.log(1 - A).T)) # compute cost

# BACKWARD PROPAGATION (TO FIND GRAD)
dw = 1 / m * np.dot(X, (A - Y).T)
db = 1 / m * np.sum(A - Y)

assert (dw.shape == w.shape)
assert (db.dtype == float)
cost = np.squeeze(cost) # 删除shape为1的维度,比如cost=[[1]],则经过np.squeeze处理后cost=[1]
assert (cost.shape == ())

grads = {"dw": dw, "db": db}

return grads, cost


# 梯度下降
def optimize(w, b, X, Y, num_iterations, learning_rate, print_cost=False):
"""
This function optimizes w and b by running a gradient descent algorithm

Arguments:
w -- weights, a numpy array of size (num_px * num_px * 3, 1)
b -- bias, a scalar
X -- data of shape (num_px * num_px * 3, number of examples)
Y -- true "label" vector (containing 0 if non-cat, 1 if cat), of shape (1, number of examples)
num_iterations -- number of iterations of the optimization loop
learning_rate -- learning rate of the gradient descent update rule
print_cost -- True to print the loss every 100 steps

Returns:
params -- dictionary containing the weights w and bias b
grads -- dictionary containing the gradients of the weights and bias with respect to the cost function
costs -- list of all the costs computed during the optimization, this will be used to plot the learning curve.

Tips:
You basically need to write down two steps and iterate through them:
1) Calculate the cost and the gradient for the current parameters. Use propagate().
2) Update the parameters using gradient descent rule for w and b.
"""

costs = []

for i in range(num_iterations):

# Cost and gradient calculation
grads, cost = propagate(w, b, X, Y)

# Retrieve derivatives from grads
dw = grads["dw"]
db = grads["db"]

# update rule
w = w - dw * learning_rate
b = b - db * learning_rate

# Record the costs
if i % 100 == 0:
costs.append(cost)

# Print the cost every 100 training examples
if print_cost and i % 100 == 0:
print("Cost after iteration %i: %f" % (i, cost))

params = {"w": w, "b": b}

grads = {"dw": dw, "db": db}

return params, grads, costs


# 利用logistic regression判断Y的标签值
def predict(w, b, X):
'''
Predict whether the label is 0 or 1 using learned logistic regression parameters (w, b)

Arguments:
w -- weights, a numpy array of size (num_px * num_px * 3, 1)
b -- bias, a scalar
X -- data of size (num_px * num_px * 3, number of examples)

Returns:
Y_prediction -- a numpy array (vector) containing all predictions (0/1) for the examples in X
'''

m = X.shape[1]
Y_prediction = np.zeros((1, m))
w = w.reshape(X.shape[0], 1)

# Compute vector "A" predicting the probabilities of a cat being present in the picture
A = sigmoid(np.dot(w.T, X) + b) # A.shape = (1,m)

for i in range(A.shape[1]):

# Convert probabilities A[0,i] to actual predictions p[0,i]
if A[0, i] > 0.5:
Y_prediction[0, i] = 1
else:
Y_prediction[0, i] = 0

assert (Y_prediction.shape == (1, m))

return Y_prediction


# 构建整个模型
def model(X_train,
Y_train,
X_test,
Y_test,
num_iterations=2000,
learning_rate=0.5,
print_cost=False):
"""
Builds the logistic regression model by calling the function you've implemented previously

Arguments:
X_train -- training set represented by a numpy array of shape (num_px * num_px * 3, m_train)
Y_train -- training labels represented by a numpy array (vector) of shape (1, m_train)
X_test -- test set represented by a numpy array of shape (num_px * num_px * 3, m_test)
Y_test -- test labels represented by a numpy array (vector) of shape (1, m_test)
num_iterations -- hyperparameter representing the number of iterations to optimize the parameters
learning_rate -- hyperparameter representing the learning rate used in the update rule of optimize()
print_cost -- Set to true to print the cost every 100 iterations

Returns:
d -- dictionary containing information about the model.
"""

# initialize parameters with zeros
w, b = initialize_with_zeros(X_train.shape[0])

# Gradient descent
parameters, grads, costs = optimize(w, b, X_train, Y_train, num_iterations,
learning_rate, print_cost)

# Retrieve parameters w and b from dictionary "parameters"
w = parameters["w"]
b = parameters["b"]

# Predict test/train set examples
Y_prediction_test = predict(w, b, X_test)
Y_prediction_train = predict(w, b, X_train)

# Print train/test Errors
print("train accuracy: {} %".format(
100 - np.mean(np.abs(Y_prediction_train - Y_train)) * 100))
print("test accuracy: {} %".format(
100 - np.mean(np.abs(Y_prediction_test - Y_test)) * 100))

d = {
"costs": costs,
"Y_prediction_test": Y_prediction_test,
"Y_prediction_train": Y_prediction_train,
"w": w,
"b": b,
"learning_rate": learning_rate,
"num_iterations": num_iterations
}

return d


def main():
train_set_x_orig, train_set_y, test_set_x_orig, test_set_y, classes = load_dataset(
)

m_train = train_set_x_orig.shape[0]
m_test = test_set_x_orig.shape[0]
num_px = train_set_x_orig.shape[2]

train_set_x_flatten = train_set_x_orig.reshape(m_train, -1).T
test_set_x_flatten = test_set_x_orig.reshape(m_test, -1).T
train_set_x = train_set_x_flatten / 255.
test_set_x = test_set_x_flatten / 255.

# train model
d = model(
train_set_x,
train_set_y,
test_set_x,
test_set_y,
num_iterations=2000,
learning_rate=0.005,
print_cost=True)

# 判断单张图片是否有猫
my_image = "a.jpg" # change this to the name of your image file

# We preprocess the image to fit your algorithm.
fname = "images/" + my_image
image = np.array(ndimage.imread(fname, flatten=False))
my_image = scipy.misc.imresize(image, size=(num_px, num_px)).reshape((1, num_px * num_px * 3)).T
my_predicted_image = predict(d["w"], d["b"], my_image)

print("y = " + str(np.squeeze(my_predicted_image)) + ", your algorithm predicts a \"" + classes[
int(np.squeeze(my_predicted_image)),].decode("utf-8") + "\" picture.")

plt.imshow(image)
plt.show()


main()
Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×